RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents.
RStudio projects are associated with R working directories. You can create an RStudio project:
To create a new project use the Create Project command (available on the Projects menu and on the global toolbar).
When working in RStudio with regular R scripts, use # to add comments to your code. Additionally, any comment line which includes at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section. For example, all of the following lines create code sections.
Note that the line can start with any number of pound signs (#) so long as it ends with four or more -, =, or # characters.
To navigate between code sections you can use the Jump To menu available at the bottom of the editor. You can expand the folded region by either clicking on the arrow in the gutter or on the icon that overlays the folded code.
RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.
To indent or reformat the code use:
It’s a good practice to stick to one naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:
calcStatsOfDF - function to calculate statsnIteration - prefix n to indicate an integer variablefPatientAge - prefix f to indicate a float variablesPatientName - prefix s to indicate a string variablevFileNames - v for vectorlColumnNames - l for a listThis document is written as an R Notebook. It allows to create publish-ready documents with text, graphics, and interactive plots. Can be saved as an html, pdf, or a Word document.
An R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input. The text is formatted using R Markdown.
dtBabies = data.table(name= c("Jackson Smith", "Emma Williams", "Liam Brown", "Ava Wilson"),
gender = c("M", "F", "M", "F"),
year2011= c(74.69, NA, 88.24, 81.77),
year2012=c(84.99, NA, NA, 96.45),
year2013=c(91.73, 75.74, 101.83, NA),
year2014=c(95.32, 82.49, 108.23, NA),
year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
summary(dtBabies)
## name gender year2011 year2012
## Length:4 Length:4 Min. :74.69 Min. :84.99
## Class :character Class :character 1st Qu.:78.23 1st Qu.:87.86
## Mode :character Mode :character Median :81.77 Median :90.72
## Mean :81.57 Mean :90.72
## 3rd Qu.:85.00 3rd Qu.:93.58
## Max. :88.24 Max. :96.45
## NA's :1 NA's :2
## year2013 year2014 year2015
## Min. : 75.74 Min. : 82.49 Min. : 93.73
## 1st Qu.: 83.73 1st Qu.: 88.91 1st Qu.:102.67
## Median : 91.73 Median : 95.32 Median :106.39
## Mean : 89.77 Mean : 95.35 Mean :106.38
## 3rd Qu.: 96.78 3rd Qu.:101.78 3rd Qu.:110.09
## Max. :101.83 Max. :108.23 Max. :119.01
## NA's :1 NA's :1
The test data frame is in the wide format. Here, we convert it to long format using function melt. The key is to provide the names of identification (parameter id.vars) and measure variables (parameter measure.vars). If none are provided, melt will try to guess them automatically, which sometimes may result in a wrong conversion.
Both variables can be provided as explicit strings with column names, or as column numbers.
The original data frame contained missing values. The function melt has an option na.rm=T to omit them in the long-format table.
dtBabiesLong = melt(dtBabies,
id.vars = c('name', 'gender'),
measure.vars = 3:7,
variable.name = 'year',
value.name = 'weight',
na.rm = T)
dtBabiesLong
The function dcast from reshape2 package converts from wide to long format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.
Note that because some combinations of name + gender + year do not exist, the dcast function will introduce NAs.
dtBabiesWide = dcast(dtBabiesLong,
name + gender ~ year,
value.var = 'weight')
dtBabiesWide
In the above example, you’ve noticed that the formula interface of dcast requires providing column names explicitly. Hardcoding them this way in the script is potentially dangerous, for example when column names change for some reason. A much handier way would be to store the column names somewhere at the beginning of the script, wherer it’s easy to change them, and then use variables with those names in the code.
lCol = list()
lCol$meas = 'weight'
lCol$time = 'year'
lCol$group = c('name', 'gender')
lCol
## $meas
## [1] "weight"
##
## $time
## [1] "year"
##
## $group
## [1] "name" "gender"
Now we build a string from column names stored in lCol list using paste0 function:
sFormula = paste0(lCol$group[1], '+', lCol$group[2], '~', lCol$time)
sFormula
## [1] "name+gender~year"
Finally, we use the string as a formula using as.formula function:
dcast(dtBabiesLong,
as.formula(sFormula),
value.var = lCol$meas)
ggPlot is a powerfull plotting package that requires data in the long format. Let’s plot weight over time.
ggplot2::ggplot(dtBabiesLong, aes(x = year, y = weight)) +
geom_line()
Ups, doesn’t look good… The reason being that plotting function doesn’t know how to link the points. The logical way to link them is by name column,
Here we add group option in the ggplot aesthetics (aes) to avoid the mistake from the above.
Also, to avoid hard-coding column names we use aes_string instead of aes.
In order to produce facets per gender, we use function facet_wrap. It uses formula interface, same as in case of dcast, hence we need to build the formula from s string.
sFormula2 = paste0('~', lCol$group[2])
sFormula2
## [1] "~gender"
p1 = ggplot2::ggplot(dtBabiesLong, aes_string(x = lCol$time, y = lCol$meas, group = lCol$group[1])) +
geom_line() +
geom_point() +
facet_wrap(as.formula(sFormula2)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p1
Making an interactive plot from a ggplot object is extremely easy. Just use ggplotly function from the amazing plotly package. The interactive plot will remain in the html document knitted from the R notebook.
p_load(plotly)
ggplotly(p1)
Let’s write a function to calculate statistics of a data frame. In the simplest case the function will calculate the mean of a single column of a data frame.
We will expand the funciton to csalculate the mean by a group, and to calculate robust statistics, i.e. median instead of the mean.
The function will need the following input parameters:
calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {
if (inRobust) {
outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
} else {
outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
}
return(outDt)
}
Since column names will be provided to our function as string parameters, we cannot hard-code them inside of the function. Therefore, we use function get to use the string stored in the variable inMeasVar as the column name.
Calculate the mean of the weight column:
calcStats(dtBabiesLong, 'weight')
Calculate the mean of the weight column by name and gender. Use robust stats:
calcStats(dtBabiesLong, 'weight', inGroupName = c('name', 'gender'), inRobust = T)
Once inside the function, click Menu > Code > Insert Roxygen Skeleton (Shift-Option-Command R). A pecial type of comment will be added above the function. You can add your text next to parameters.
#' Calculates stats of a data frame
#'
#' @param inDt Input data table in the long format
#' @param inMeasVar Name of the measurement column
#' @param inGroupName Name of the grouping column (default NULL)
#' @param inRobust If true, the function calculates median instead of the mean (default False)
#'
#' @return Data table with summary stats
#' @export
#' @import data.table
#'
#' @examples
#' # example usasge
calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {
if (inRobust) {
outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
} else {
outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
}
return(outDt)
}